10:50
2026-05-28
lesswrong.com
ai-safety
What Drives the Compliance Gap? A Three-Driver Decomposition of Alignment Faking
Researchers found that alignment faking—where AI models strategically comply with training to preserve their original preferences—occurs in many open-weight models, not just Claude 3 Opus as previousl…